Time series data


The graphs below don’t have proper titles, axis labels, legends, etc. Please take care to do this on your own graphs.


Structure and basic plots

A time series measures a single variable over many points in time. Time intervals may be regularly or irregularly spaced, but we’ll only consider regularly-spaced data today. We’ll focus on visualizing a single variable over time, since there are already so many choices and information to work with.

For this demo, we’re going to work with a dataset that’s actually already loaded when you start R: It’s defined under co2, and known as the “Mauna Loa Atmospheric CO2 Concentration” dataset. This dataset contains 468 monthly measurements of CO2 concentration from 1959 to 1997.

When you start R, if you type in co2, this is what you’ll see:

co2
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1959 315.42 316.31 316.50 317.56 318.13 318.00 316.39 314.65 313.68 313.18
## 1960 316.27 316.81 317.42 318.87 319.87 319.43 318.01 315.74 314.00 313.68
## 1961 316.73 317.54 318.38 319.31 320.42 319.61 318.42 316.63 314.83 315.16
## 1962 317.78 318.40 319.53 320.42 320.85 320.45 319.45 317.25 316.11 315.27
## 1963 318.58 318.92 319.70 321.22 322.08 321.31 319.58 317.61 316.05 315.83
## 1964 319.41 320.07 320.74 321.40 322.06 321.73 320.27 318.54 316.54 316.71
## 1965 319.27 320.28 320.73 321.97 322.00 321.71 321.05 318.71 317.66 317.14
## 1966 320.46 321.43 322.23 323.54 323.91 323.59 322.24 320.20 318.48 317.94
## 1967 322.17 322.34 322.88 324.25 324.83 323.93 322.38 320.76 319.10 319.24
## 1968 322.40 322.99 323.73 324.86 325.40 325.20 323.98 321.95 320.18 320.09
## 1969 323.83 324.26 325.47 326.50 327.21 326.54 325.72 323.50 322.22 321.62
## 1970 324.89 325.82 326.77 327.97 327.91 327.50 326.18 324.53 322.93 322.90
## 1971 326.01 326.51 327.01 327.62 328.76 328.40 327.20 325.27 323.20 323.40
## 1972 326.60 327.47 327.58 329.56 329.90 328.92 327.88 326.16 324.68 325.04
## 1973 328.37 329.40 330.14 331.33 332.31 331.90 330.70 329.15 327.35 327.02
## 1974 329.18 330.55 331.32 332.48 332.92 332.08 331.01 329.23 327.27 327.21
## 1975 330.23 331.25 331.87 333.14 333.80 333.43 331.73 329.90 328.40 328.17
## 1976 331.58 332.39 333.33 334.41 334.71 334.17 332.89 330.77 329.14 328.78
## 1977 332.75 333.24 334.53 335.90 336.57 336.10 334.76 332.59 331.42 330.98
## 1978 334.80 335.22 336.47 337.59 337.84 337.72 336.37 334.51 332.60 332.38
## 1979 336.05 336.59 337.79 338.71 339.30 339.12 337.56 335.92 333.75 333.70
## 1980 337.84 338.19 339.91 340.60 341.29 341.00 339.39 337.43 335.72 335.84
## 1981 339.06 340.30 341.21 342.33 342.74 342.08 340.32 338.26 336.52 336.68
## 1982 340.57 341.44 342.53 343.39 343.96 343.18 341.88 339.65 337.81 337.69
## 1983 341.20 342.35 342.93 344.77 345.58 345.14 343.81 342.21 339.69 339.82
## 1984 343.52 344.33 345.11 346.88 347.25 346.62 345.22 343.11 340.90 341.18
## 1985 344.79 345.82 347.25 348.17 348.74 348.07 346.38 344.51 342.92 342.62
## 1986 346.11 346.78 347.68 349.37 350.03 349.37 347.76 345.73 344.68 343.99
## 1987 347.84 348.29 349.23 350.80 351.66 351.07 349.33 347.92 346.27 346.18
## 1988 350.25 351.54 352.05 353.41 354.04 353.62 352.22 350.27 348.55 348.72
## 1989 352.60 352.92 353.53 355.26 355.52 354.97 353.75 351.52 349.64 349.83
## 1990 353.50 354.55 355.23 356.04 357.00 356.07 354.67 352.76 350.82 351.04
## 1991 354.59 355.63 357.03 358.48 359.22 358.12 356.06 353.92 352.05 352.11
## 1992 355.88 356.63 357.72 359.07 359.58 359.17 356.94 354.92 352.94 353.23
## 1993 356.63 357.10 358.32 359.41 360.23 359.55 357.53 355.48 353.67 353.95
## 1994 358.34 358.89 359.95 361.25 361.67 360.94 359.55 357.49 355.84 356.00
## 1995 359.98 361.03 361.66 363.48 363.82 363.30 361.94 359.50 358.11 357.80
## 1996 362.09 363.29 364.06 364.76 365.45 365.01 363.70 361.54 359.51 359.65
## 1997 363.23 364.06 364.61 366.40 366.84 365.68 364.52 362.57 360.24 360.83
##         Nov    Dec
## 1959 314.66 315.43
## 1960 314.84 316.03
## 1961 315.94 316.85
## 1962 316.53 317.53
## 1963 316.91 318.20
## 1964 317.53 318.55
## 1965 318.70 319.25
## 1966 319.63 320.87
## 1967 320.56 321.80
## 1968 321.16 322.74
## 1969 322.69 323.95
## 1970 323.85 324.96
## 1971 324.63 325.85
## 1972 326.34 327.39
## 1973 327.99 328.48
## 1974 328.29 329.41
## 1975 329.32 330.59
## 1976 330.14 331.52
## 1977 332.24 333.68
## 1978 333.75 334.78
## 1979 335.12 336.56
## 1980 336.93 338.04
## 1981 338.19 339.44
## 1982 339.09 340.32
## 1983 340.98 342.82
## 1984 342.80 344.04
## 1985 344.06 345.38
## 1986 345.48 346.72
## 1987 347.64 348.78
## 1988 349.91 351.18
## 1989 351.14 352.37
## 1990 352.69 354.07
## 1991 353.64 354.89
## 1992 354.09 355.33
## 1993 355.30 356.78
## 1994 357.59 359.05
## 1995 359.61 360.74
## 1996 360.80 362.38
## 1997 362.49 364.34

These numbers are expressed in parts per million (ppm) - if you’ve taken a chemistry class at CMU (which I haven’t), I assure you know more about this than I do, but ppm is a very common measure for pollutants and contaminants.

The co2 object is actually defined with a class we haven’t seen yet, ts (time series):

class(co2)
## [1] "ts"

As a result, it contains extra attributes about the times associated with each value, and base R graphs can plot this automatically (we don’t need to specify the time range manually):

plot(co2)

This is typically called a line plot. Here’s how you would plot the same data in the gg style (note that you need the package ggfortify to do this) without constructing the typical long-table format we’re used to:

library(ggfortify)
autoplot(co2)

In time series, we are interested in checking for trends (does the variable tend to increase or decrease over time?), seasonality (are there tendencies that regularly occur? If so, at what intervals do they occur?), general variability (i.e., variation beyond average trends and seasonality), and outliers (unusual spikes or valleys). Which of these do we see in the co2 data?

In order to show you more general-purpose plotting methods, we’ll treat co2 as numeric instead of ts from now on. This is actually the form that many time series data take (they’re usually not in the ts format), so this will also show you some of the formatting/structuring issues you’ll have to work with in the wild.

In the code below, we create a dataset with 1:468 as the obs_i variable (x-axis variable) and the CO2 concentration as the co2_val variable (y-axis) variable). Then, we use our usual ggplot to plot the data:

library(tidyverse)
co2_tbl <- tibble(co2_val = as.numeric(co2)) %>%
  mutate(obs_i = 1:n())
co2_tbl %>%
  ggplot(aes(x = obs_i, y = co2_val)) + 
  geom_line() + 
  labs(x = "Time index", y = "CO2 (ppm)")

Note the x-axis: This will not be helpful/informative to readers. It’s a common mistake for people to just put something like “index” on the x-axis, but they don’t even tell you what they are indexing. Be sure to not make this mistake! To properly structure time series data and label it correctly, you’ll need to know how to work with Date type objects in R (which we will discuss next…).

Formatting Date data in R

The following code redefines creates a new column in our dataset to be the monthly dates 1/1/1959, 2/1/1959, … , 12/1/1997 using the as.Date() function (given the description of the time range in help(co2)):

co2_tbl <- co2_tbl %>%
  # We can use the seq() function with dates which is pretty useful!
  mutate(obs_date = seq(as.Date("1959/1/1"), by = "month",
                        length.out = n()))
co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line() + 
  labs(x = "Year", y = "CO2 (ppm)")

Unfortunately, the default format for dates in as.Date() is Year/Month/Day. If you prefer another format, such as the common Month/Day/Year (as I used above in the previous paragraph), you need to include the format argument within as.Date(), as such (note that the Y is capitalized - yes, as.Date() is that picky):

co2_tbl <- co2_tbl %>%
  mutate(obs_date = seq(as.Date("1/1/1959", format = "%m/%d/%Y"), 
                        by = "month",
                        length.out = n()))
co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line() + 
  labs(x = "Year", y = "CO2 (ppm)")

Note that all we needed to do was convert the obs_date variable to a Date class, and then we could use ggplot as is. This is because ggplot knows to use a special date scale for the x-axis when x has class Date. As a result, we can easily play with the breaks on the date axis using scale_x_date(). For example:

  • For a subset of the data, maybe we only want ticks every 4 months, using date_breaks.
  • We can specify the format of the date with date_labels. (See Details section of ?strftime for the formatting options. Here, we choose abbreviated month %b and full year %Y.)
co2_tbl[1:26,] %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line() + 
  scale_x_date(date_breaks = "4 months", date_labels = "%b %Y") +
  labs(x = "Year", y = "CO2 (ppm)") +
  # Modify the x-axis text 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


Moving window averages

One of the things we look for in time series data are trends. But measurements can be quite noisy, making it hard to see the underlying trend. For example: Most investors probably shouldn’t react by selling or buying every time the stock market jumps up or down a little bit, but it might make sense to respond to a longer-term underlying trend.

As we discussed in lecture, moving averages are a great way to get an idea of underlying trends. The ggseas package has a function called stat_rollapplyr() for computing moving averages. Below is an example of computing (and displaying) a moving average using this function:

library(ggseas)
co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color="red") +
  stat_rollapplyr(width = 12, align = "right") +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "Width = 12")

Let’s look at the stat_rollapplyr() line more closely. Setting width = 12 and align = "right" states, “For each point, plot the average of that point and the 11 points behind it.” Since this is monthly data, this is the average of a year’s worth of data. Given this, think about why there is no trend line for the first 11 points on the left-hand side of the graph.

To get a better idea of what happens when we change the width argument, below is an example of setting width = 2 and width = 24.

library(patchwork)
wid2 <- co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color="red") +
  stat_rollapplyr(width = 2, align = "right") +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "Width = 2")

wid24 <- co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color="red") +
  stat_rollapplyr(width = 24, align = "right") +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "Width = 24")
wid2 + wid24

Setting width to be smaller gives us a line that fits the data much better, but it’s also much less smooth. (In fact, setting width = 1 is literally the data itself.) Meanwhile, setting width to be larger gives us a much smoother fit, but it doesn’t fit the data as precisely. Ultimately, choosing how much smoothness you want for your visual is very much an art: You want the trend to be smooth enough that you get an idea of the overall trend, but you also don’t want to deviate from your data too much. Some choices are better than others: For example, below is a trend line whose width is obviously too large (obviously because the trend line doesn’t fit the data well at all).

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color="red") +
  stat_rollapplyr(width = 100, align = "right") +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "Width = 100")
## Warning: Removed 99 rows containing missing values (`geom_line()`).

One last thing to note is that moving averages are extremely similar to loess smoothing. Moving averages compute the average outcome within a sliding window; meanwhile, loess computes a weighted linear regression within a sliding window. Indeed, loess can be used for understand the overall trend of a time series as well:

moving_ave_plot <- co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "Width = 12")

loess_plot <- co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color = "red") +
  geom_smooth(method = "loess") +
  labs(x = "Year", y = "CO2 (ppm)", title = "Loess")

moving_ave_plot + loess_plot
## Warning: Removed 11 rows containing missing values (`geom_line()`).
## `geom_smooth()` using formula = 'y ~ x'

Moving averages actually literally are loess, but with the formula y ~ 1 instead of y ~ x (i.e., an intercept-only model), and using a rectangular kernel (which is the NOT the default in loess() in R).


Adding lines and rectangles to a time series

If a specific event in time has a notable effect on your time series data, it is common to mark that event with a vertical line, or with rectangular shading if that event occurs over a period of time. For example, it’s common for financial data to denote recessions with gray rectangular shading (this is often called “recession shading”), because economic recessions have notable impacts on financial data.

To add a vertical line to your time series, you just use geom_vline(). For example, here is a plot of the co2_tbl data with a vertical line for the beginning of 1980:

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  geom_vline(aes(xintercept = as.Date("1980-01-01"))) +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "CO2 Emissions Over Time")

Note that the xintercept has to have a very particular format: It has to be a date that has the same dating format as your data. For example, here is the exact same code, but with the more common Month-Day-Year format for the date; the line will not display (presumably because it thinks I’m trying to add a vertical line for Year 1):

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  geom_vline(aes(xintercept = as.Date("01-01-1980"))) +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "CO2 Emissions Over Time")

Furthermore, note that you cannot just put an integer (e.g., the year) into geom_vline() when plotting a time series in this way. For example, here is what happens when I try to just input the number 1980 (thinking I’ll draw a line at the year 1980):

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  geom_vline(aes(xintercept = 1980)) +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "CO2 Emissions Over Time")

Meanwhile, to add rectangular shading to a time series plot, you use geom_rect(), which we haven’t seen before. However, it’s very intuitive: Within aes(), you just need to specify an xmin, xmax, ymin, and ymax. For example, below is our time series with gray shading between 1980 and 1990:

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_rect(aes(xmin = as.Date("1980-01-01"), xmax = as.Date("1989-12-31"),
                ymin = -Inf, ymax = Inf),
            fill = "grey", alpha = 0.5) +
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  geom_vline(aes(xintercept = as.Date("01-01-1980"))) +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "CO2 Emissions Over Time")

There are just a few important things to keep in mind when using rectangular shading on time series:

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) +
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  geom_rect(aes(xmin = as.Date("1980-01-01"), xmax = as.Date("1989-12-31"),
                ymin = -Inf, ymax = Inf),
            fill = "grey", alpha = 0.5) +
  geom_vline(aes(xintercept = as.Date("01-01-1980"))) +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "CO2 Emissions Over Time")

Lags and autocorrelation functions

Now we will walk through how to compute autocorrelations using lags.

Imagine computing the autocorrelation at each lag (1, 2, 3, …). For example, the autocorrelation at lag 1 is the correlation between the time series itself and the time series shifted back one timepoint. Computing the autocorrelation at each lag will thus give us a vector of autocorrelations, commonly called the autocorrelation function (ACF).

You can think of the ACF as taking in a lag \(\ell\) and spitting out a correlation between -1 and 1. The higher the correlation, the more dependent the time series measurements are on each other; and the larger the \(\ell\) with high correlation, the more this dependence “lasts” over time.

Base R has the acf() function which can be used to compute and plot the ACF for a large number of lags.

acf(co2_tbl$co2_val)

And the ggplot version is:

auto_corr <- acf(co2_tbl$co2_val, plot = FALSE)
autoplot(auto_corr)

By default, these plots display 95% confidence intervals, where autocorrelations outside this confidence interval are deemed “unusually large.” These intervals will depend on the data itself (for example, in another application, the intervals may not be \(\pm 0.1\)), and the way these intervals are computed is outside the scope of this class (but it is easily Googleable for those interested; see the end of this document for example).

Because the autocorrelations are way above the confidence interval, we would say that this visual clearly shows that the co2 data is highly autocorrelated, due to the strong global trend. What if we remove the trend, then look at the ACF of the residuals? We can compute moving averages using rollapply (in the zoo package, loaded when you load ggseas).

# Use rollapply to compute a Moving Average, then compute residuals (co2_val - mov_ave)
co2_tbl <- co2_tbl %>%
  mutate(mov_ave = 
           zoo::rollapply(co2_val, width = 12, FUN = "mean", 
                          align = "right", fill = NA),
         res = co2_val - mov_ave)
# Just to understand what this is doing, let's look
# at the first few rows of the data:
co2_tbl[1:13,]
## # A tibble: 13 × 5
##    co2_val obs_i obs_date   mov_ave    res
##      <dbl> <int> <date>       <dbl>  <dbl>
##  1    315.     1 1959-01-01     NA  NA    
##  2    316.     2 1959-02-01     NA  NA    
##  3    316.     3 1959-03-01     NA  NA    
##  4    318.     4 1959-04-01     NA  NA    
##  5    318.     5 1959-05-01     NA  NA    
##  6    318      6 1959-06-01     NA  NA    
##  7    316.     7 1959-07-01     NA  NA    
##  8    315.     8 1959-08-01     NA  NA    
##  9    314.     9 1959-09-01     NA  NA    
## 10    313.    10 1959-10-01     NA  NA    
## 11    315.    11 1959-11-01     NA  NA    
## 12    315.    12 1959-12-01    316. -0.396
## 13    316.    13 1960-01-01    316.  0.373

Just to show that this is indeed the moving average:

co2_tbl %>%
  ggplot(aes(x = obs_date)) +
  geom_line(aes(y = mov_ave)) +
  geom_line(aes(y = co2_val), color = "red") +
  labs(x = "Year", y = "CO2 (ppm)")
## Warning: Removed 11 rows containing missing values (`geom_line()`).

Now let’s look at a plot of the residuals (i.e., the observed outcomes minus the moving average), as well as an autocorrelation plot of the residuals:

# Plot the residuals
co2_tbl %>%
  ggplot(aes(x = obs_date, y = res)) +
  geom_line() +
  labs(x = "Year", y = "Residuals of CO2 (ppm)")
## Warning: Removed 11 rows containing missing values (`geom_line()`).

# Compute ACF of the residuals
# (except the first 11 cases, which are NA)
autoplot(acf(tail(co2_tbl$res, -11), plot = FALSE))

Now what pattern do we see in the ACF plot? Which lags have the most positive and most negative values? What does this mean for monthly data?

See this link for tips on mimicking the acf() plot in ggplot. For example, you can save the output of acf() and then put it inside a data frame:

co2_acf <- acf(tail(co2_tbl$res, -11))
co2_acf_df <- with(co2_acf, data.frame(lag, acf))
# the resulting data.frame looks like this:
head(co2_acf_df)
##   lag         acf
## 1   0  1.00000000
## 2   1  0.83287000
## 3   2  0.42898732
## 4   3 -0.05635883
## 5   4 -0.47308749
## 6   5 -0.73412921
# Then, we can make a simple "bar" plot:
# (bar is in quotes because it's not displaying a categorical variable)
ggplot(co2_acf_df, aes(x = lag, y = acf)) + geom_bar(stat = "identity")

# Alternatively, you could use:
# ggplot(co2_acf_df, aes(x = lag, y = acf)) + geom_col()
# which does the same thing.

Seasonal decomposition

Recall that, in lecture, we mentioned that there are three main characteristics of time series:

The idea above shows us that we can remove global trends from the dataset and then look for seasonal trends. The base R function stl() does this explicitly using loess smoothers:

Seasonal decomposition of Time series by Loess”. It actually also separates the seasonal trend from the irregular / residual noise.

In ggseas, we can mimic this by using ggsdc() instead of starting with the ggplot() function. (The argument s.window is a kind of loess smoother span for the seasonal extraction.)

co2_tbl %>%
  ggsdc(aes(obs_date, co2_val), 
        frequency = 12, method = "stl", s.window = 12) + 
  geom_line() +
  labs(x = "Year", y = "CO2 (ppm)")

The sum of “trend” + “seasonal” + “irregular” gives us back the original data, shown in “observed” (the top facet).

Does it look like stl found all interesting global or periodic trends, or is there any signal left in the “irregular” component?

Here is the intuition for how the stl() function works:

This is very technically not 100% correct, but it’s very, very close to what actually happens within the stl() function; see help(stl) for details.